Modulation spectral features for speech emotion recognition using deep neural networks

نویسندگان

چکیده

This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis sound comprise two important cognitive parts: early auditory cortex-based processing. considers spectrogram-based representation whereas includes extraction temporal modulations from spectrogram. spectrogram is called feature (MSF). As (CQT) provides higher resolution at salient low-frequency regions speech, we find that CQT-based spectrogram, together with its modulations, a enriched emotion-specific information. We argue CQT-MSF when used 2-dimensional convolutional network can provide time-shift invariant deformation insensitive SER. Our results show outperforms standard mel-scale on popular SER databases, Berlin EmoDB RAVDESS. also our proposed shift scattering coefficients, hence, showing importance joint hand-crafted self-learned instead reliance complete features. Finally, perform Grad-CAM to visually inspect contribution over

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic speech emotion recognition using modulation spectral features

In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation ...

متن کامل

Speech Emotion Recognition Using Scalogram Based Deep Structure

Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...

متن کامل

Multimodal Emotion Recognition Using Deep Neural Networks

The change of emotions is a temporal dependent process. In this paper, a Bimodal-LSTM model is introduced to take temporal information into account for emotion recognition with multimodal signals. We extend the implementation of denoising autoencoders and adopt the Bimodal Deep Denoising AutoEncoder modal. Both models are evaluated on a public dataset, SEED, using EEG features and eye movement ...

متن کامل

Recognition of Human Emotion in Speech Using Modulation Spectral Features and Support Vector Machines

Automatic recognition of human emotion in speech aims at recognizing the underlying emotional state of a speaker from the speech signal. The area has received rapidly increasing research interest over the past few years. However, designing powerful spectral features for high-performance speech emotion recognition (SER) remains an open challenge. Most spectral features employed in current SER te...

متن کامل

Binary Deep Neural Networks for Speech Recognition

Deep neural networks (DNNs) are widely used in most current automatic speech recognition (ASR) systems. To guarantee good recognition performance, DNNs usually require significant computational resources, which limits their application to low-power devices. Thus, it is appealing to reduce the computational cost while keeping the accuracy. In this work, in light of the success in image recogniti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Speech Communication

سال: 2023

ISSN: ['1872-7182', '0167-6393']

DOI: https://doi.org/10.1016/j.specom.2022.11.005